Skip to main content
Advanced Science logoLink to Advanced Science
. 2023 Aug 10;10(28):2303496. doi: 10.1002/advs.202303496

Dissecting the Determinants of Domain Insertion Tolerance and Allostery in Proteins

Jan Mathony 1,2,4,, Sabine Aschenbrenner 4, Philipp Becker 1,2,3, Dominik Niopek 4,
PMCID: PMC10558690  PMID: 37562980

Abstract

Domain insertion engineering is a promising approach to recombine the functions of evolutionarily unrelated proteins. Insertion of light‐switchable receptor domains into a selected effector protein, for instance, can yield allosteric effectors with light‐dependent activity. However, the parameters that determine domain insertion tolerance and allostery are poorly understood. Here, an unbiased screen is used to systematically assess the domain insertion permissibility of several evolutionary unrelated proteins. Training machine learning models on the resulting data allow to dissect features informative for domain insertion tolerance and revealed sequence conservation statistics as the strongest indicators of suitable insertion sites. Finally, extending the experimental pipeline toward the identification of switchable hybrids results in opto‐chemogenetic derivatives of the transcription factor AraC that function as single‐protein Boolean logic gates. The study reveals determinants of domain insertion tolerance and yielded multimodally switchable proteins with unique functional properties.

Keywords: allostery, domain insertion, optogenetics, protein engineering


The artificial recombination of protein domains is an engineering approach that can yield hybrid proteins with new functionality. However, the parameters that determine domain insertion tolerance are poorly understood. Using an unbiased domain insertion screening strategy followed by statistical‐ and machine learning‐based analysis, the authors identified indicators of suitable insertion sites and engineered novel, optogenetic bacterial transcription activators.

graphic file with name ADVS-10-2303496-g003.jpg

1. Introduction

The recombination of protein domains is an important driver of evolution. It allows nature to repeatedly build on the same set of stable protein folds and their corresponding functions, while enabling evolutionary innovation by exploring novel combinations and interdependencies thereof.[ 1 , 2 ] This observation has inspired protein engineering approaches that combine evolutionary unrelated protein domains into single polypeptide chains, thereby creating hybrid proteins with new‐to‐nature properties.[ 3 , 4 , 5 , 6 , 7 ] From a synthetic biology perspective, a particularly interesting strategy is the insertion of receptor domains into effector proteins with the aim to allosterically couple the effector conformation to the receptor state.[ 3 , 5 , 8 ] Receptor activation, e.g. via chemicals or light, will induce an allosteric signal relaying to the effector's active site (e.g., a catalytic surface or binding site), thereby enabling highly targeted control of the effector‐mediated cellular function.

Although a number of hybrid proteins have been created by domain insertion engineering over the past years, their rational design remains challenging and screening of larger libraries and iterative optimization is commonly required to obtain functional hybrids.[ 9 , 10 , 11 , 12 , 13 ] Importantly, the identification of an insertion site at which the fusion of two protein domains results in their functional coupling and does not irreversibly interfere with the activity of either protein part represents a largely unsolved problem. These persisting challenges can be explained by our limited understanding of the structural and biophysical requirements and constraints that generally determine suitable domain insertion sites.

Advances in the generation of comprehensive domain insertion libraries via transposon‐[ 12 , 14 ] or oligonucleotide pool‐based cloning, [ 15 ] as well as the coupling of fluorescence‐activated cell sorting (FACS) to deep sequencing, facilitate the efficient generation and subsequent investigation of larger domain insertion datasets.[ 11 , 12 , 16 ] Employing such experimental approaches, recent studies investigated the impact of domain insertion on the membrane localization of potassium ion channels.[ 16 , 17 ] Using the resulting data to train random forest models, the authors analyzed biophysical properties that contribute to domain insertion permissibility in ion channels.[ 17 ] This previous research was centered around a single type of membrane protein as well as the impact of domain insertion on subcellular protein localization. To render domain insertion engineering a broadly‐applicable strategy, however, studying the domain insertion tolerance at the functional level as well as deciphering the determinants of functional coupling between re‐combined protein domains will be essential.

Here, we set out to broaden the understanding of domain insertion requirements in diverse protein classes. Toward this goal, we inserted up to five structurally and functionally unrelated domains into several different, unrelated candidate effector proteins covering nearly all possible sequence positions. Using gene circuits that relay effector activity to a fluorescent readout, the resulting, comprehensive libraries of protein hybrids were screened for active variants by FACS and subsequent next‐generation sequencing (NGS). Training of machine learning models on the resulting datasets allowed us to dissect parameters that affect domain insertion tolerance and revealed sequence conservation statistics as the most powerful predictors for domain insertion success. Finally, extending our experimental pipeline toward the screening of engineered, switchable effector variants yielded two potent optogenetic derivatives of the E. coli transcription factor AraC that function as single‐protein chemo‐optogenetic Boolean logic gates.

2. Results

2.1. A functional FACS‐NGS Screen of Domain Insertion Tolerance

To elucidate the domain insertion tolerance within an evolutionarily and functionally diverse set of effector proteins, we first constructed comprehensive insertion libraries. The libraries comprised of effector proteins carrying insert domains at all possible sequence positions (Figure  1A). Four structurally unrelated proteins that are widely applied in synthetic and cell biology were chosen as effector protein scaffolds: the transcription factor AraC, the recombinase Flp, a previously described variant of the TVMV protease,[ 18 ] and ơ‐factor F (SigF) from Bacillus subtilis (Figure 1B). Protein hybrid libraries were generated via saturated programmable insertion engineering (SPINE) for all four candidates using the PDZ domain from murine α1‐syntrophin as insert (Figure 1B).[ 15 ] With its small size of 86 amino acids, its globular fold and the N‐ and C‐terminus located in close proximity (∼10 Å), the PDZ domain is ideally suited for domain insertion screening (Table S1, Supporting Information).[ 11 ] Further, to elucidate how the domain identity would affect the functionality of the resulting protein hybrids, four additional insert domains of varying size and structure (see Table S1, Supporting Information for details) were selected and fused at all possible sequence positions into one of the candidate proteins, AraC. These included the AsLOV2 (Avena sativa) domain, the estradiol binding domain from human estrogen receptor‐α (ERD), an enhanced yellow fluorescent protein (eYFP)[ 19 ] and the synthetic rapamycin receptor uniRapR.[ 20 ] Following the construction of all eight libraries, a nearly complete coverage of all possible insertion sites was observed by deep sequencing (Figure S1, Supporting Information).

Figure 1.

Figure 1

Domain insertion profiling of functionally and structurally diverse proteins. A) Flow chart of the domain insertion screening workflow. B) Overview of the screened PDZ‐domain insertion libraries. The depicted structures of the parent proteins are AF2 predictions. PDB‐ID of PDZ: 1Z86. C) Enrichment score histograms for the different candidate proteins are shown. The Log2 norm. read counts correspond to the fraction of reads after enrichment normalized to the fraction of read counts within the initial library. Data from the four candidate proteins AraC, Flp, TVMV protease, and SigF with PDZ domain inserts are shown. Enrichments are mapped to the respective insertion site as indicated by the position of the acceptor proteins preceding the insertion. Light green, dark green: individual replicates. Grey: variants with zero reads after enrichment. Red: variants missing in the initial library. Insertion sites correspond to residues preceding the inserted domain.

To enable functional screening of these libraries in Escherichia coli, we next created reporter gene circuits that robustly couple the activity of the effector protein to the expression or stability of a red fluorescent protein (RFP) (Figure S2A, Supporting Information, Methods). We then co‐transformed E. coli Top10 cells with the reporters and their corresponding effector‐insert hybrid libraries, followed by an analysis of the reporter activity via FACS. Fluorescence histograms of the initial libraries showed a large fraction of non‐functional hybrid protein candidates as indicated by a large proportion of non‐ or low fluorescent cells (Figure S2B, Supporting Information). Still, a small but considerable fraction corresponding to medium to high fluorescent cells and hence active protein hybrids was observed. Sorting this fraction resulted in a clear enrichment of cells expressing high RFP levels in the case of AraC and SigF and less pronounced, but still visible enrichments in fluorescent cells for Flp and the TVMV protease (Figure S2C, Supporting Information). Quantitative differences between the four effector library pools were caused by varying proportions of active versus inactive hybrid protein candidates in the initial libraries as well as differences in the dynamic range of the reporter assays (Figure S2, Supporting Information, controls). To ensure a significant enrichment of active variants, we sorted each library in two consecutive rounds. Next, we assessed enrichment or depletion of each individual domain insertion variant in the sorted libraries by adapting the previously published DIP‐seq pipeline.[ 12 ] In short, the fraction of read counts corresponding to a variant after enrichment was normalized by the fraction of read counts from the initial library and the resulting scores were log2‐scaled. Variants that went extinct during sorting and thus had a read count of zero were assigned a log2 value of −10, since this represents the assay's detection limit. To ensure the reproducibility of the workflow, the whole screening and sequencing process was performed in two independent replicates.

Results from different replicates correlated well, with a Pearson correlation coefficient (Pearson's r) > 0.8 in all cases except one (Pearson's r for TVMV‐PDZ = 0.65), while the level of enrichment/depletion differed between replicates for individual variants (Figure S3, Supporting Information). As cross‐validation of our enrichment and analysis pipeline, we experimentally measured the activity (RFP expression) for a set of hybrids individually and compared it to the variant enrichment scores obtained by NGS. As expected, a drastic difference in activity between the enriched and the depleted variants was measured in most cases (Figure S4, Supporting Information). For the following analysis, the mean of the two biological replicates was used.

2.2. Domain Insertion Permissibility is Sequentially and Structurally Clustered

Mapping the enrichment scores of the PDZ insertion libraries to the amino acid sequences of the respective, four effector proteins revealed that positions tolerating insertions occurred in clusters spanning regions of ≈10–30 consecutive amino acids (Figure 1C). Insertion tolerance thus appears to be regionally confined, rather than being determined by features of individual residues or positions. Roughly 80 % of the insertions within each protein were depleted, i.e., they do not tolerate domain fusion (Figure 1C).

Moreover, the number of clusters with enrichments differed substantially between the insert domains tested in combination with the AraC effector (Figure S5, Supporting Information). For the LOV2 insert domain, we observed several insertion‐permissive regions throughout the sequence of AraC comparable to those for the PDZ insert. In contrast, the other three insert domains were enriched at substantially fewer positions, mainly at the C‐terminus of AraC. As LOV2 and PDZ are considerably smaller (<150 AA) than the other tested domains, insert size appears to be a determining factor for insertion tolerance. In addition, the relative distance of the PDZ‐ and LOV2 domain's termini (14.1 Å and 20.7 Å, respectively; note this is the distance as measured from the terminal residues in the structures from Table S4, Supporting Information) are smaller as compared to the other insert domains, although uniRapR exhibits an only marginally larger distance between N‐ and C‐termini (24.4 Å) (Table S1, Supporting Information). Interestingly, we hardly observed insertion sites selective for just one specific insert domain. This indicates that domain insertion permissibility is a general property of protein regions rather than a lock‐key relation between an insertion site and an individual insert domain.

Next, we mapped the enrichment scores onto structures of the respective effector proteins. To this end, we used Alphafold2 (AF2)‐predicted protein structures, as well as experimentally resolved full length structures if available[ 21 , 22 ] (Figure  2A–D; Figures S6S8, Supporting Information). Importantly, the predicted structures were generally in excellent agreement with the available experimentally validated (partial) folds (Figure S9, Supporting Information). Structural analysis revealed strong depletions around functionally critical regions, such as the DNA‐ and arabinose‐binding sites of AraC, the catalytic center of the Flp recombinase, or the DNA‐binding region of SigF (Figure 2A–D; Figures S6 and S7, Supporting Information). For TVMV protease, depletions within the hydrophobic core and around the active site were observed, albeit trends were overall less pronounced for this candidate protein (Figure 2C; Figure S6C, Supporting Information). Interestingly and in contrast to common assumptions underlying domain insertion engineering strategies, no clear enrichment at surface‐exposed unstructured loops could be identified for any of the candidates. Rather, insert sites were observed at similar frequency in helices, sheets, and loops (Figure 2A–D).

Figure 2.

Figure 2

Secondary structure and amino acid features alone do not explain the experimentally observed domain insertion patterns. A) Domain insertion permissive positions are clustered at diverse, locally confined surface sites. The insertion scores from the PDZ libraries are mapped onto the AF2 structure predictions of the candidate proteins namely AraC A) and Flp recombinase B) the crystal structure of the TVMV protease (PDB‐ID: 3MMG) C) and an AF2 structure prediction of SigF D). Functionally critical residues of AraC, Flp, and the TVMV protease are indicated in grey. E) Correlation between variant enrichment and the average surface exposed area (ASA) of the residues neighboring an insertion site are plotted for AraC‐PDZ. Spearman's r is indicated. F) Violin plot of the insertion score distribution with respect to different secondary structure elements is shown for the AraC‐PDZ insertion library. For each insertion site, the secondary structure assignment of the amino acids prior to and after the insertion was considered. The IQR is marked by the box and the median is represented by a white dot. Whiskers extend to the 1.5‐fold IQR or to the value of the smallest or largest enrichment, respectively. G) Spearman correlations between all datasets and diverse positional features are shown (Table S2, Supporting Information. Linker idx: Different amino acid specific linker propensity indices that were reported by the indicated authors.

Next, to quantitatively analyze these qualitative observations, we correlated the measured enrichments with a set of basic positional properties such as the average solvent accessible area (ASA), secondary structure, and amino acid identity of the residues neighboring a respective insertion site (Figure 2E,F; Figures S10 and S11, Supporting Information). Of note, none of these basic properties explained the observed enrichments. In order to obtain a more comprehensive overview of protein features that could affect domain insertion success, a larger set of position‐specific features was gathered (Table S2, Supporting Information, Methods). Further, these comprised a number of biophysical amino acid properties, fetched from the “AAindex” database, [ 23 , 24 ] as well as several previously published linker propensity indices. [ 25 , 26 , 27 ] These indices describe to which extent amino acids tend to be present in inter‐domain linkers. Regions with high linker propensities are commonly expected to be well suited for the insertion of domains. Further, we included the pLDDT confidence score from AF2 models, which was previously shown to correlate with intrinsically disordered sites.[ 28 ] Moreover, the Kullback‐Leibler divergence (KLD), a measure for sequence conservation, was extracted from multiple sequence alignments of the candidate protein with natural homologs. Finally, additional scores, such as the frequency of insertions and deletions at every position in evolutionary related sequences, were included (refer to Methods). Spearman correlations between all enrichment scores for the screened libraries and each feature revealed overall weak trends, with the majority of the correlation coefficients lying in the range between −0.2 and 0.2 (Figure 2G; Figure S12, Supporting Information). This observation is in agreement with previous results in the context of ion channels.[ 16 , 17 ] Additionally, we confirmed that AF2‐based structure predictions of insertion variants could not explain the observed enrichment trends (Note S1 and Figures S13 and S14, Supporting Information).

2.3. Machine Learning Reveals Statistical Features Predicting Domain Insertion Tolerance

The absence of any clear correlation between the experimental data and positional protein properties raised the question if a combination of the above features would enable the prediction of domain insertion tolerance. To address this question, machine learning models were trained on the entirety of the gathered insertion site properties in combination with amino acid identity and secondary structure information as additional features. The learning objective was to discriminate between enriched sites that tolerated the insertion of a domain versus depleted positions, as these states appeared to be well separated in the data (Figure 1C). As model architecture, we chose a gradient boosting classifier, [ 29 ], i.e., an algorithm that additively combines multiple simpler machine learning models (in this case basic regression trees) by minimizing a loss functions. Such algorithms are known to perform particularly well on tabular datasets. The model was trained for each protein using five‐fold cross‐validation. We assessed the classifier's performance on each cross‐validation test set, using standard metrics including the area under the receiving operator characteristic (AUROC) and average precisions (AP) (refer to the experimental section for details). The models reached surprisingly good performances on datasets derived from individual candidate proteins ranging from a mean AUROC of 0.72 for SigF‐PDZ to 0.92 for Flp‐PDZ (Figure  3A,B; Figure S15, Supporting Information). The corresponding AP ranged from 0.41 (SigF‐PDZ) to 0.82 (AraC‐PDZ) (Figure 3A,B; Figure S15, Supporting Information). The lower AP values are caused by the high proportion of negative labels in the respective datasets. Encouraged by these results, we optimized the model on a complete training set including all four proteins, which resulted in a mean AUROC of 0.84 and a mean AP of 0.54 (Figure 3C,D). To place the classifiers performance into context, we compared it to several benchmarks on a previously withheld test set. These included a random choice baseline, and the use of individual features as predictors. Our classifier exhibited highly improved predictive power as compared to all individual features, reaching an AUROC of 0.85 and an AP of 0.56, suggesting that the entirety of input features implicitly provided the information necessary for successful prediction of domain insertion tolerance (Figure 3E; Figure S16, Supporting Information).

Figure 3.

Figure 3

Gradient boosting classifier models reveal parameters informative of domain insertion tolerance. A,B) ROC curves of the model trained with fivefold cross‐validation on the AraC‐PDZ dataset A) or the combined PDZ datasets of all candidate proteins B). Results from individual cross‐validations are shown in grey and the mean ROC is depicted in red (see Experimental Section 4.13 for details on the used metrics). C,D) Precision‐recall metrics for individual cross‐validation folds are shown. The mean average precision (Mean AP) is indicated. E), The AUROC and average precision of the trained classifier and different benchmarks are shown. The metrics were assessed on a previously withheld test set. F) Bar plot indicating the Gini importance (i.e., mean decrease in impurity) of each feature for the model trained on the full dataset. G) The ROC metric of a gradient boosting model that was trained exclusively on the amino acid identities is shown. H) ROC of a model that was trained on a subset of features comprised of Deletion frequency, KLD, insert frequency, mean insertion length, the linker propensity index by Suyama[ 25 ] and the pLDDT score from AF2 structure predictions. A,B,G,H) The ROC is depicted for individual folds in grey and the mean ROC in red. The mean AUC is marked in light red. Precise values are indicated.

Finally, we aimed at identifying the key features most informative for the prediction of domain insertion tolerance. To this end, the influence of individual features on the model's performance was assessed by measuring the permutation importance of each feature as well as its Gini importance (Figure 3F; Figure S17A, Supporting Information).[ 30 ] The permutation importance measures the decrease of a model's accuracy upon random shuffling of the values for an individual feature. The Gini importance, in contrast, measures the average importance of regression tree nodes corresponding to a certain feature by calculating the respective gain in impurity. Both measures indicated that most parameters were dispensable, while the alignment‐derived properties were most critical for successful prediction. In that line, a model trained solely on information about the identity of insertion‐adjacent amino acids did reach an AUROC of 0.64 (Figure 3G). As a consequence, we depleted features from the input data in a stepwise manner, while ensuring the performance of the model did not decrease upon feature removal. Following this procedure, we were able to train a reduced model, only based on six features: KLD, deletion frequency, insertion frequency, mean insertion length, pLDDT, and the linker index by Suyama et al.[ 25 ] With an AUROC of 0.87 and an AP of 0.55, the reduced model performed as good as the original one trained on all features (Figure 3H). Lastly, the feature importance analysis was repeated with the reduced model. Akin to the previous observations, KLD, insertion frequency, and deletion frequency, i.e., evolutionary and statistical features derived from MSAs, were detected as most important parameters explaining domain insertion tolerance (Figure S17B,C, Supporting Information).

2.4. Identification of Potent Light‐Switchable AraC Variants

Up to this point, we focused on features determining the preservation of function upon domain fusion into an effector protein. Taking our experimental screening approach one step further, we next investigated to which extent insertions can mediate allosteric behavior, i.e., a functional link between an insert and the effector. Such switchable hybrids are of great interest for various applications in biology and bioengineering. Toward this goal, we re‐visited our initial AraC‐LOV hybrid library. The AsLOV2 domain is known to reversibly unfold its two terminal helices in response to blue light (≈450 nm), a property that has been harnessed for the development of light‐switchable effector proteins in optogenetics.[ 3 , 31 ] It was hence interesting to explore, whether screening our comprehensive AraC‐LOV library could readily reveal potent, optogenetic AraC variants.

We, therefore, repeated the screen for the AraC‐LOV library, this time incubating the cultures under blue‐light exposure prior to FACS sorting. The resulting variant enrichment was then compared to that of the same library sorted upon incubation of cultures in the absence of light (Figure S5, Supporting Information). Globally, we observed a high similarity between the resulting enrichment scores for each position under both conditions (Figure  4A; Figure S18A, Supporting Information). However, a subset of regions showed significant differences between the enrichment scores obtained for the libraries cultured in the dark and light (Figure S18B,C, Supporting Information). Strikingly, further analyzing the insertion variants in these regions revealed a plethora of presumably light‐activatable as well as light‐inhibited AraC‐LOV hybrids corresponding to multiple different AraC insertion sites (Figure S18B,C, Supporting Information).

Figure 4.

Figure 4

LOV2 domain insertion screening yields chemo‐optogenetic AND and NIMPLY gates. A) Scatterplot showing the relation between the enrichment scores of individual variants for the libraries incubated in the light and dark. B) Characterization of light‐responsive AraC variants. Inducers were supplied in the indicated concentrations. The samples were incubated under light exposure or in darkness, followed by measurements of reporter fluorescence (RFP) and OD600. Bars represent means from three independent replicates. Error bars show the SD. The corresponding logic gates are indicated. C) Agar photograph generated via an AraC‐S170‐LOV2 controlled RFP reporter. Top agar mixed with inducers and bacteria carrying an RFP reporter plasmid and the AraC‐S170‐LOV2 variants were plated on an ager plate, which also contained arabinose and IPTG. The plate was incubated overnight, while being illuminated through a photo‐mask of the logo on the left (without the text). D) Cultures were inoculated into media carrying 400 µm IPTG and 25 mm arabinose. The samples were incubated either in darkness or under blue‐light exposure. At the beginning of the experiment and every three hours from then, RFP fluorescence and OD600 were measured, followed by 1:30 dilution in fresh media. Points represent the mean of n = 3 biological replicates. Error bars indicate the SD. E) An AF2 prediction of the full length AraC (green) is shown alongside the crystal structure (grey and white) of the arabinose binding domain. The relative positioning of the structures was obtained by superimposing the AF2 model onto a dimer crystal structure. Insertion sites and key residues are highlighted and their function is indicated. PDB‐ID: 2ARA.

From this set of optogenetic variants, we chose two AraC‐LOV hybrids for further characterization, one light‐ON switch carrying the LOV2 insertion behind I113 (AraC‐I113‐LOV) and a light‐OFF switch with the LOV2 insertion behind S170 (AraC‐S170‐LOV). We then assessed the performance of these AraC‐LOV hybrids using the previously established RFP transcription reporter in E. coli under varying arabinose concentrations, as well as light conditions. Interestingly, the activity of both AraC‐LOV hybrids was co‐dependent on the arabinose concentration and the light stimulus (Figure 4B; Figure S18D, Supporting Information). The AraC‐I113‐LOV samples showed a 23‐fold increase in reporter expression upon illumination at an arabinose concentration of 4 mm. At higher arabinose concentrations, increasing fluorescence levels were also observed for samples incubated in the dark, indicating that the chemical inducer could, to some extent, override the light‐mediated regulation. Vice versa, the AraC‐S170‐LOV samples showed efficient, light‐dependent repression of reporter activity practically to baseline with a 43‐fold switch in reporter activity at 16 mm arabinose. Moreover, the light‐regulation was in this case not affected by high arabinose concentrations. Comparing the overall activation of the AraC variants in response to arabinose, the activity of the wildtype saturates already at an inducer concentration of 4 mm, while the LOV2‐hybrids, in particular AraC‐S170‐LOV, require higher arabinose concentrations (up to 16 mm) to trigger maximum reporter activity. This suggests that LOV2 insertion weakened the sensitivity of AraC to arabinose. The observed behavior establishes the AraC‐I113‐LOV and AraC‐S170‐LOV hybrids as single‐protein Boolean logic devices capable of integrating light and arabinose as inputs and functioning as AND and NIMPLY gates, respectively (Figure 4B; Note S2, Supporting Information).

Next, we investigated if these new optogenetic AraC variants facilitate spatiotemporal control of gene expression. Growing the AraC‐S170‐LOV reporter strain on agar while illuminating it through a photo‐mask confined reporter RFP expression to light‐shielded regions and hence resulted in display of the photomask shape on the fluorescent cell layer (Figure 4C). Moreover, incubating AraC‐S170‐LOV and AraC‐I113‐LOV reporter strain cultures while alternating between light and dark conditions resulted in reporter expression oscillation, the phase of which depended on the AraC‐LOV variant used (Figure 4D). Taken together, the results showcase the versatility of this new chemo‐optogenetic toolkit with respect to spatiotemporal control of gene expression in E. coli.

On a structural level it is striking that most insertion sites resulting in switchable AraC behavior are located within the region between the ligand‐binding domain (LBD) and the DNA‐binding domain (LBD) of AraC (Figure 4E). This trend can be explained by the functional role this region has, serving as a dimerization interface upon AraC activation and by mediating the relative flexibility of both domains.[ 32 , 33 ] It is thus no surprise that LOV2 domain insertions in this area can influence AraC function. Of note, AF2 structure predictions of AraC‐I113‐LOV and AraC‐S170‐LOV capture the former variant in a more compact conformation, which is in agreement with the less flexible repressor state of wildtype AraC[ 32 , 33 ] (note: AF2 predicts the LOV2 structure in its dark‐adapted state) (Figure S19, Supporting Information). AraC‐S170‐LOV, in turn, was predicted to have a more relaxed conformation, as would be expected for an active AraC (Figure S19, Supporting Information). To further investigate the robustness of allosteric coupling in both hybrid proteins, we screened a set of point mutants for their effects on wildtype AraC and its engineered derivatives (Figure S20, Supporting Information). The majority of mutations did not improve the AraC‐I113‐LOV switch, but rather reduced reporter activity, in the active (light) state or increased leakiness, i.e., reporter activity in the dark. Excitingly, several AraC‐S170‐LOV point mutants (e.g., T50S, G141D, and V284F; mutations correspond to residues in wildtype AraC) showed an increased level of activity in the dark as compared to the initial variant, while likewise retaining potent reporter repression upon illumination. The mutations E3I and T241C, in turn, permanently impaired the function of the AraC‐S170‐LOV variant, while having no significant effect on AraC‐I113‐LOV. Finally, none of the tested mutations had major effects on the activity of wildtype AraC. Collectively, our data highlight (i) the variant‐specificity of mutational effects in the engineered allosteric AraC‐LOV hybrids and (ii) their increased functional and likely structural sensitivity toward minor sequence alterations. Moreover, the mutational data in conjunction with the arabinose‐dependency data (Figure 4B) indicate the interconnection of the natural arabinose‐mediated allosteric regulation with a LOV2‐induced artificial allosteric pathway.

3. Conclusion

In this study, we investigated the constraints of domain insertion engineering at the functional and structural level. Thereby, we considerably extended the existing body of work toward new protein families and, for the first time, compared the insertion tolerance of several evolutionary unrelated proteins side‐by‐side directly using effector protein function as readout. In agreement with previous studies, [ 16 , 17 ] our data showcases the absence of any simplistic explanations for domain insertion permissibility. In contrast, we demonstrated that gradient boosting classifiers can help to decipher the importance of factors underlying domain insertion tolerance. Our models identified MSA‐derived conservation statistics as main determinants of domain insertion tolerance, thus suggesting an evolutionarily informed approach to be particularly promising for domain insertion engineering (Figure 3). In this context, parallels can be drawn with statistical coupling analysis (SCA), a method for identifying co‐evolving residues based on the statistical evaluation of MSAs.[ 5 , 13 , 34 ] The SCA‐derived residue patterns termed “protein sectors” have been proposed to be functionally critical and well suited for identifying allosteric sites to engineer protein switches.[ 13 ] In contrast, our work underscores the indicative value of evolutionary insertion/deletion events.

We note that in context of domain insertions, the predictive power of machine learning models is still constrained by the amount of available training data, which is, in turn, restricted by the current experimental capacity limits. The use of experimental data, such as the presented insertion library screens, in combination with larger datasets extracted from public protein sequence databases might provide an elegant solution to address this limitation in data size.

In addition, it will be interesting to see to what extent the observed trends are replicated across entirely unrelated protein classes, such as enzymes, which are particularly vulnerable even to minor structural changes in the active site or proteins the activity of which depends on complex domain motions. In these cases, the preservation of activity might rely on factors that cannot easily be inferred from conservation statistics.

With respect to allosteric proteins, the screening pipeline developed here was efficient in identifying allosteric switches (Figure 4; Figure S18, Supporting Information). In previous work, a GFP‐maltose‐binding protein insertion library was enriched alternatingly in the presence and absence of the input trigger in three consecutive rounds.[ 12 ] Our adaption of the method using parallel enrichment of the same library under different conditions (here culturing samples in the presence or absence of light) turned out to be sufficient to reliably identify light‐switchable proteins. A more stringent selection regime during FACS could potentially render even a single round of enrichment sufficient, which would further simplify and streamline the workflow for the engineering of switchable effector proteins.

We note that several optogenetic bacterial expression systems exist.[ 35 , 36 , 37 ] These include the light‐responsive AraC variant BLADE, which is based on the Vivid LOV domain from Neurospora crassa functioning via light‐induced AraC dimerization.[ 35 ] In contrast to these previous examples, the transcription factors developed here are co‐dependent on two stimuli, namely light and arabinose. This has interesting implications for synthetic biology applications and gene circuit control. Transcription factors co‐dependent on two inputs enable the independent control of the state (on/off) and amplitude of activation for genetic programs. Previously, the combination of chemically inducible transcription factors and light‐responsive regulators had to be combined within far more complex circuits to achieve the same goal.[ 37 , 38 , 39 ] The optogenetic variants presented here highly simplify such experimental setups by reducing the underlying system to a single protein component (see Note S2, Supporting Information). Such single‐protein Boolean logic gates could considerably streamline the design and increase the robustness of complex genetic circuits and biocomputing programs by reducing the number of required components and through the direct integration of signals within a single molecule.

In summary, our study pinpoints determinants of domain insertion tolerance and showcases the power of unbiased domain insertion screens for the engineering allosteric effector proteins with applications in synthetic biology and beyond.

4. Experimental Section

Molecular Cloning

All constructs used in this study are listed in Table S3 (Supporting Information). The corresponding amino acid sequences of the encoded proteins are shown in Table S4 (Supporting Information). Plasmids were constructed using Golden Gate assembly.[ 40 ] In brief, DNA fragments were amplified by PCR (Q5 2x Master Mix, New England Biolabs (NEB)), with primers carrying type IIS restriction enzyme recognition sites in their 5′‐overhangs, which enabled the scarless assembly of constructs. PCRs were performed according to the NEB standard protocols. For Golden Gate assembly, the procedure described by Engler et al. was followed.[ 40 ] DNA‐oligonucleotides were ordered from Merck and Integrated DNA Technologies (IDT). Double‐stranded DNA fragments were purchased at IDT. Point mutants were cloned by introducing the changes via mismatching primers upon amplification of the full plasmid and subsequent phosphorylation and ligation. PCR products were resolved on 0.5x Tris‐acetate‐EDTA (TAE) 1% agarose gels and the corresponding bands were cut out and purified using the QIAquick Gel Extraction kit (Qiagen). Restriction enzymes and T4 DNA ligase were obtained from NEB and Thermo Fisher Scientific. Following DNA assembly, Top10 E. coli cells (Thermo Fisher Scientific) were transformed with the respective construct, plated on agar, and incubated overnight at 37 °C. Liquid cultures were inoculated from single colonies and grown overnight at 37 °C while shaking at 220 rounds per minute (rpm). DNA was purified using the QIAamp DNA Mini kit (Qiagen). All constructs were sequence‐verified using Sanger sequencing (Microsynth Seqlab and Genewiz). The plasmid pTKEI‐Dest, which served as a backbone for the insertion libraries, was a gift from David Savage (Addgene plasmid # 79784 ).[ 12 ]

Reporter Assays

All reporter circuits used the monomeric red fluorescent protein 1 (RFP) as readout.[ 41 ] The design of the genetic circuits is depicted in Figure S2A (Supporting Information). In short, the AraC reporter was created by placing the RFP coding sequence under the control of a pBAD promoter. In case of the Flp recombinase, RFP was expressed from a constitutive promoter (J23102, http://parts.igem.org/Promoters/Catalog/Anderson). However, the coding sequence was inverted and flanked by Flp recognition target (FRT) sites. In the ground state, a dysfunctional mRNA is transcribed and only upon inversion of the RFP open reading frame by the recombinase, RFP is expressed. To measure TVMV protease activity, a ssrA‐like degradation tag[ 42 ] was fused to a constitutively expressed RFP; a TVMV recognition site was placed in between RFP and the degradation tag. Active TVMV protease would thus cleave off the degron resulting in RFP stabilization and an increase in fluorescence. Many related potyvirus proteases undergo a process called autolysis, [ 43 ] during which the protease cleaves off its own C‐terminal region albeit at low efficiency. This results in a truncated protease with decreased activity. To ensure that only one TVMV protein species would be present during all assays, a previously reported, truncated TVMV version[ 44 ] was used for insertion library generation. Finally, a reporter for SigF was constructed, based on a SigF‐specific promoter design previously reported by Bervoets et al.[ 45 ]

Domain Insertion Library Generation

To generate insertion libraries covering all possible effector protein positions, saturated programmable insertion engineering (SPINE) was used.[ 15 ] In short, the protein of interest was subdivided into chunks of ≈50 amino acids. For each chunk, an oligonucleotide sub‐pool (Agilent) was designed, comprising 50 individual DNA sequences, each of which carried a Type IIS restriction enzyme recognition site handles behind a specific amino acid encoding triplet. A python pipeline for the automatic design of the required DNA sequences provided by Coyote‐Maestas et al.[ 15 ] was employed for oligo pool design. The sub‐pools were then individually cloned into an expression vector carrying the full‐length coding sequence of the respective effector protein of interest and transformed into chemically competent Oneshot Top10 E. coli. To ensure at least 40‐fold coverage of the library, serial dilutions were plated on agar plates following transformation and the number of colony‐forming units was calculated. The plasmid sub‐libraries were purified from the bacteria using the QIAamp DNA Mini Preparation Kit (Qiagen). The DNA concentration was measured using the Quant‐iT dsDNA (HS) assay kit (Thermo Fisher Scientific) and all sub‐libraries for each individual effector protein were pooled using equal DNA concentrations. To ensure that no wildtype protein contamination was carried on during cloning, the insertion handle was replaced by a kanamycin expression cassette via Golden Gate assembly. E. coli cells were transformed and plated on three 20 cm LB‐agar plates, supplemented with 50 µg mL−1 chloramphenicol and 25 µg mL−1 of kanamycin (Carl‐Roth). Again, a library coverage of at least 20× was ensured by serial dilutions and colony counting. The next day, each plate was rinsed with 3 mL of LB and the colonies were gently scraped off with a spatula. The resulting liquid cultures were collected from the plates and pooled for each protein. Plasmid DNA was then purified from the cultures and the kanamycin handle was replaced by the insert domain of choice, again using Golden Gate cloning. Finally, Oneshot Top10 E. coli carrying the respective reporter plasmid were transformed with the assembled libraries by electroporation. Following a recovery in super optimal broth supplemented with 20 mm glucose (Carl Roth) (SOC) for 1 h at 37 °C and 220 rpm, transformed cells were grown in LB (50 µg mL−1 chloramphenicol and 25 µg mL−1 of kanamycin) overnight. Serial dilutions plated on agar were performed. Plates were incubated overnight, and a library coverage was estimated from colony counts (coverage was >50‐fold for all samples). Finally, glycerol stocks of the libraries were prepared, by mixing the cultures with sterile 50% (v/v) glycerol at a ratio of 1:1, and stocks were stored at −80 °C until usage.

FACS‐Based Library Enrichment

Precultures of LB media (50 µg mL−1 of chloramphenicol and 25 µg mL−1 of kanamycin) were inoculated from glycerol stocks of E. coli strains carrying the insertion libraries. Positive control samples expressing the wildtype effector protein without insert, as well as negative controls expressing a different protein of similar size (not activating the reporter) from the same plasmid backbone, were included. The precultures were incubated for 16 h at 37 °C while shaking at 220 rpm. The next day, 1 mL LB cultures were inoculated with 10 µL from the precultures. These main cultures were supplemented with 16 mM L‐arabinose and 400 µm IPTG for AraC, 400 µm IPTG for the TVMV protease, 200 µm IPTG for Flp, 100 µm IPTG for SigF for the first enrichment round, and 200 µm for SigF during the second round of enrichment. These cultures were incubated for 16 h at 37 °C while shaking at 220 rpm. For the AraC‐LOV2 libraries, two identical replicates were generated, one of which was incubated under blue light illumination and the other one in the dark. The next morning, the samples were diluted 1:100 in 1×PBS (Thermo Fisher Scientific) and kept on ice until sorting. FACS was performed on a FACSAria Fusion flow cytometer (BD Biosciences) at the ZMBH FACS facility (Heidelberg University). E. coli cells were identified and gated using the forward scatter (FSC) and side scatter (SSC) values (Figure S21, Supporting Information). The red fluorescent peak was sorted from each library. If no clear peak was visible, the 5% cells with the highest RFP levels were sorted. 25 000 cells were sorted for each library into LB media. Next, the collected cells were recovered for one hour in LB media without antibiotics at 37 °C and shaking at 220 rpm. Subsequently, 50 µg mL−1 chloramphenicol and 25 µg mL−1 of kanamycin were added, followed by incubation of cultures overnight. The next day, glycerol stocks were prepared from the cultures representing sorted libraries. A second round of FACS‐sorting and enrichment was performed by repeating the procedure starting from the glycerol stocks after the first round of enrichment. FACS data were analyzed using the cytoflow python package (https://cytoflow.github.io/).

Next Generation Sequencing

The input libraries, as well as the enriched sorted fractions were subjected to heat lysis. Cells were pelleted and resuspended in water. Aliquots were heated to 95 °C for 10 min, followed by centrifugation at 10 000 g for 10 min to remove cell debris. The supernatant was transferred to new tubes and stored at – 20 °C until further use. The coding sequence of the libraries was amplified using the Q5 Hot Start High‐ Fidelity DNA Polymerase (NEB) and the PCR amplicons were separated from primer dimers on a 0.5x TAE 1% agarose gel. The bands representing the protein hybrid libraries were excised and DNA was purified using the QIAquick Gel Extraction Kit (Qiagen). The DNA concentration was then measured with the Quant‐iT dsDNA (HS) assay kit (Thermo Fisher Scientific) using a plate reader (Tecan Infinite 200 Pro). Next, the DNA was fragmented and the sequencing libraries were prepared using the Illumina Nextera XT kit (Illumina). The manufacturer's protocol was followed, with two modifications. First, to prevent under‐tagmentation, only 0.2 ng of DNA was used as input and the tagmentation step was performed for 15 min, instead of 5 min. Second, during library preparation, the samples to be pooled were barcoded using the Nextera XT Index Kit v2 (Illumina). The final sequencing libraries were then purified using AMPure XP magnetic beads (Beckman Coulter) according to the manufacturer's protocol. A two‐sided size selection was performed using 25 µL beads together with 50 µL input reaction during the first size selection step and 100 µL of beads during the second step. Following library clean‐up, the DNA concentration was measured again using the Quant‐iT dsDNA (HS) assay kit (Thermo Fisher Scientific) and the different libraries were pooled at equal concentrations. Next, library quality was assessed on a Bioanalyzer (Agilent) using the Agilent DNA 1000 Kit. Finally, samples were sequenced using the paired‐end Illumina MiSeq and NextSeq sequencing services at the EMBL Gene Core facility (Heidelberg).

Experimental Characterization of Individual Variants from the Domain Insertion Screen

Individual protein hybrids were isolated from the sorted fractions or cloned individually and stored as glycerol stocks in 25% glycerol (Carl Roth). The variants tested are listed in Table S3 (Supporting Information). Precultures of Oneshot Top10 cells carrying a RFP reporter plasmid specific to the respective protein hybrid, as well as a plasmid encoding the respective switchable variant, were inoculated from glycerol stocks into lysogeny broth (LB) (Carl Roth), supplemented with 50 µg mL−1 chloramphenicol (Carl Roth) and 25 µg mL−1 of kanamycin (Carl Roth). Cultures were prepared in technical triplicates in 96‐well plates (Corning), using a volume of 200 µL per well. The precultures were incubated for 16 h at 37 °C while shaking at 220 rpm. Main cultures were similarly prepared in 96‐well plates, using LB supplemented with 50 µg mL−1 chloramphenicol and 25 µg mL−1 of kanamycin, using the same induction scheme as for the FACS screen. The cultures were inoculated with 3 µL from the respective precultures and grown at 37 °C and 220 rpm for 16 h. Following incubation, RFP fluorescence and OD600 were measured on a plate reader (Tecan Infinite 200 Pro). For RFP measurements, an excitation wavelength of 490 nm and an emission wavelength of 520 nm were used. The reported RFP/OD600 values were calculated by dividing the measured fluorescence by the OD600 levels. Three independent biological replicates prepared and measured on different days were generated for each variant.

Illumination Setup

For the illumination of liquid cultures, a custom‐made LED setup was used. Eight blue light high‐power LEDs (type CREE XP‐E D5‐15; emission peak ≈460 nm; emission angle ≈130°; LED‐TECH.DE) were mounted onto an aluminum plate and connected to a Switching Mode Power Supply (Manson; HCS‐3102). The LED‐plate was installed upside down within a shaking incubator, so that the LEDs could illuminate the surface area of the shaking platform from a distance of ≈30 cm. Liquid cultures were incubated in multi‐well plates and illuminated at a constant intensity of 50 µmol m‐2 s−1 (≙ 5 W m−2).

For the illumination of agar plates (see “agar plate photography” below), a custom‐made array of 96 LEDs (LB T64G‐AACB‐59‐Z484‐20‐R33‐Z, Osram, emission peak 469 nm, viewing angle 30 °, Mouser Electronics) mounted onto a circuit board was used, applying a light intensity of 15 µmol m−2 s−1 (≙ 1.5 W m−2). This device was again powered by a Switching Mode Power Supply (Manson; HCS‐3102). A photo‐mask made from black vinyl (Starlab) was cut out by hand and was directly attached to the bottom of the agar plate. The plate was then placed above the LED array at a distance of ≈5 cm. The whole setup was installed inside a standard bacteria incubator (Minitron, Infors). The LED devices were custom‐made by the workshop of the biology department at TU Darmstadt.

Characterization of AraC‐LOV2 Hybrids

Precultures of Oneshot Top10 cells (Thermo Fisher Scientific) carrying the RFP reporter plasmid for AraC and an IPTG inducible expression plasmid encoding the transcription factor or its derivatives, were inoculated from glycerol stocks into LB (Carl Roth), supplemented with 50 µg mL−1 chloramphenicol (Carl Roth) and 25 µg mL−1 of kanamycin (Carl Roth). Cultures were prepared in 48‐well plates (Corning), using a volume of 0.5 mL per well. The precultures were incubated for 16 h at 37 °C, while shaking at 220 rpm. Main cultures were similarly prepared in 48‐well plates, using LB supplemented with 50 µg mL−1 chloramphenicol and 25 µg mL−1 of kanamycin, together with different amounts of IPTG (Carl Roth) and L‐arabinose (Carl Roth). IPTG concentrations used in each sample are indicated in the corresponding figures/legends. The cultures were prepared in duplicates and inoculated with 5 µL from the respective precultures. Subsequently, one replicate was incubated under blue light exposure, while the other replicate was kept in the dark within the same incubator. The growth conditions were again at 37 °C and 220 rpm for 16 h. Following incubation, RFP fluorescence and OD600 were measured in a plate reader. As before, an excitation wavelength of 490 nm and an emission wavelength of 520 nm were used and the fluorescence was normalized to the OD600. Experiments were performed in three independent replicates.

Activity measurements of the AraC derivatives carrying point mutations were performed identically using an arabinose concentration of 8 mm.

Agar Plate Photography

Prior to the experiment, agar plates were prepared using 1.5% LB‐agar, supplemented with 50 µg mL−1 chloramphenicol and 25 µg mL−1 of kanamycin, 400 µm IPTG and 25 mm L‐arabinose (all Carl Roth). A preculture of the AraC‐S170‐LOV reporter strain was incubated overnight at 37 °C and 220 rpm. The next day, 0.6% LB‐agar was freshly prepared and cooled to ≈40 °C. Next, 3 mL of the liquid agar were supplemented with IPTG and L‐arabinose to final concentrations of 400 µm and 25 mm, respectively. Finally, 300 µL of the preculture was quickly added to the agar, mixed by shaking and distributed on the previously prepared agar plates. After 30 min at room temperature, the top ager had solidified, and the photo‐mask was glued to the bottom of the plate. Finally, the plate was incubated at 37 °C overnight, under constant blue light illumination. Images were acquired on the next day using a UV light source, high‐pass filter, and camera.

Reversible Optogenetic Gene Expression Control

In a 48‐well plate (Corning), 0.5 mL cultures were prepared, using LB media, supplemented with 50 µg mL−1 chloramphenicol and 25 µg mL−1 of kanamycin, 400 µm IPTG and 25 mm L‐arabinose (all Carl Roth). The wells were inoculated with 5 µL of precultures that had been prepared as described above. The samples were then incubated at 37 °C and 220 rpm for 3 h in darkness, followed by 3 h incubation under blue light exposure and a final step of 3 h in the dark. Prior to the first incubation step and after each following incubation period, the RFP fluorescence and the OD600 were measured in a plate reader. Following every incubation period the samples were diluted 1:30 into new plates with pre‐warmed fresh media, containing all supplements. The final relative fluorescence was obtained by normalizing the RFP values to the measured OD600. Three independent replicates were generated by repeating experiments on different days.

Structure Prediction with AlphaFold2

Full‐length structures of AraC, SigF, the TVMV protease, Flp, as well as the AraC‐LOV2 fusions were obtained by AlphaFold2[ 21 ] using the Colabfold implementation.[ 22 ] Structures were predicted using the “colabfold_batch” command with the “MMseqs2 (UniRef+Environmental)” MSA preferences. For the proteins without insertion, five models were run with three recycling iterations. To reduce compute time, only one model was predicted for the AraC‐LOV2 hybrids, using a single recycling step. Images of the models were generated using UCSF ChimeraX (version 1.4).[ 46 , 47 ] To compute the position‐wise RMSDs for between the AraC‐LOV2 hybrids and the respective wildtype structures, the AF2 structures of AraC and the LOV2 domain were separately superimposed onto the prediction of the fusion proteins and RMSDs were calculated amino acid‐wise. Computations were performed on the KIT Horeka cluster.

Besides AF2‐predicted structures, several previously published experimental structures are shown in several figures (Table S5, Supporting Information).

NGS and Data Analysis

To analyze the sequencing data, fastq files were de‐multiplexed using the Sabre tool (https://github.com/najoshi/sabre). The domain insertion frequencies were then calculated using a slightly modified version of the DIP‐seq library.[ 12 ] Briefly, the sequencing data were subjected to quality control, i.e., corrupted or mutant reads were filtered out. Next, reads that contained the insert sequence were selected and the insertion site was determined. Then, the enrichment scores were calculated using the following Equation 1:

Enrichmentscorei=log2countenrichediincountenrichedi/countinitialiincountinitiali (1)

where i are the insertion positions within a given protein, countenriched represents the read counts after enrichment, and countinitial indicates the read counts of the initial library that was used as input to the sorting experiments. Insertions that were missing from the initial libraries were not taken into account during the analysis. Insertion variants that entirely disappeared during sorting and could thus not be log2‐scaled, were assigned a value of −10, which was in the range of the lowest experimentally obtained enrichment scores.

To gather position‐wise protein features, diverse feature sources were used. Biophysical properties and linker propensity indices were fetched from the AAindex database.[ 23 , 24 ] Information about secondary structure, accessible surface area and pLDDT score were extracted from the AF2‐predicted structures. To map these features to the enrichment scores, the mean of the respective feature corresponding to the two amino acids that neighbor the insertion site were assigned to the enrichment. For the machine learning applications described below, the categorical features, such as secondary structures, were binarized similar to one‐hot encodings, with the difference that every position could have two possible positive labels (if the secondary structure assignments of the two neighboring residues differ). The KLD, as well as the insertion and deletion statistics were based on sequence alignments. To this end, similar sequences were gathered using position‐specific iterated basic local alignment search (PSI‐BLAST) , [ 48 , 49 ] with an expect threshold of 0.01 and a PSI‐BLAST threshold of 0.005. The maximum number of sequences was limited to 5000. Based on these sequences, an MSA was calculated with MUSCLE, [ 50 ] using the Super5 algorithm with standard parameters. Finally, the KLD was calculated as indicated in Equation 2:

Divergancei=afia·log10fiaba (2)

where the divergence is determined for the position i and f(a) is the frequency of the amino acid a at the given position, while b(a) represents the background frequency of the amino acid. Background frequencies were defined as the AA frequencies in SwissProt.[ 51 ] Of note, the definition of the gap background frequencies is non‐trivial, as discussed by Teşileanu et al.[ 52 ] Here, gaps were not included and the KLD is only based on AA frequencies. The position‐wise insertion and deletion frequencies as well as the scores for the mean and median insertion lengths were calculated from pairwise alignments between the sequence of the protein of interest and its related sequences gathered by PSI‐BLAST.

Gradient Boosting Models

In order to train predictive models on the insertion data, the enrichment scores were first binarized. All sites exhibiting a positive enrichment were assigned the label 1 and all sites with negative insertions were labeled 0. All position‐wise properties collected during data analysis were used as features. In addition, each amino acid and each secondary structure element represented individual additional features. Dataset construction and model training were performed using the Scikit‐learn framework.[ 53 ] Individual datasets for every candidate protein, as well as a complete dataset using the combined data of all four proteins were constructed. A 80:20 train‐test split was applied and the features were min‐max scaled prior to training. As model architecture, Gradient boosted regression trees were used.[ 29 ] Gradient boosting models are ensemble models that iteratively use simple models to optimize a loss function. Here, the gradient boosting classifier implementation from “Scikit learn” was used, which employs regression trees as base models. The model was optimized on the training data set using five‐fold cross‐validation. Hyperparameters were optimized on the complete dataset using grid search. For the final model, 100 estimators were trained using squared error and a learning rate of 0.1. The maximum depth of the trees was limited to four and the exponential loss was chosen. The maximum number of features parameter was kept at “auto”. The receiving operator characteristic (ROC) and precision‐recall plots were chosen as performance metrics. ROC curves illustrate the classification performance setting the true positive classification rate in relation to the false positive classification rate for different classification thresholds. The area under the ROC thus summarizes the relation between true positives and false positives in a single value. Precision recall plots instead, show the precision that a model reaches in relation to its recall or sensitivity. The average precision refers to the weighted mean of the calculated precisions. The permutation importance and loss of impurity were calculated using the respective Scikit‐learn functions.

Statistical Analysis

The domain insertion screen was performed in two independent replicates. Pearson correlations were calculated, to assess the similarity between replicates. For the analysis of domain insertion tolerance, the mean of the two replicates was used. In order to analyze the influence of positional protein features on domain insertion permissibility, Spearman correlations between the measured enrichment scores and the respective features were calculated and the Spearman r values are reported. The experimental validation of individual variants and the characterization of the AraC‐LOV hybrids were performed in n = 3 independent replicates. The mean of the measurements, as well as the standard deviation are indicated in the respective figures.

Conflict of Interest

The authors declare no conflict of interest.

Author Contributions

D.N. and J.M. conceived the study. J.M., S.A., and P.B. designed and performed the experiments. J.M. implemented the computational analysis. D.N. directed the work and secured funding. J.M. and D.N. wrote the manuscript with support from all authors.

Supporting information

Supporting Information

Acknowledgements

The authors thank the members of the Niopek lab for helpful discussions. Further, the authors are grateful to the ZMBH flow cytometry core facility (Heidelberg University) for support with cell sorting and the EMBL Genomics Core Facility (EMBL, Heidelberg) for performing deep sequencing. Finally, the authors sincerely thank the workshop at the Biology Department of the Technical University Darmstadt for the construction of customized illumination setups. Funded by the European Union (ERC, DaVinci‐Switches, project number 101041570). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. D.N. is also grateful for funding from the German Research Foundation (DFG) [project no. 453202693], the Schwiete Stiftung, and the Aventis foundation. J.M. was partially funded by the German Academic Scholarship Foundation.

Mathony J., Aschenbrenner S., Becker P., Niopek D., Dissecting the Determinants of Domain Insertion Tolerance and Allostery in Proteins. Adv. Sci. 2023, 10, 2303496. 10.1002/advs.202303496

Contributor Information

Jan Mathony, Email: jan.mathony@uni-heidelberg.de.

Dominik Niopek, Email: dominik.niopek@uni-heidelberg.de.

Data Availability Statement

The data that support the findings of this study are available in the supplementary material of this article. The computational analysis, as well as experimental raw data are available at Github under: https://github.com/Niopek-Lab/DI_screen. The structures shown in the figures are including all color codes are provided on the Github repository as ChimeraX session files. The AF2‐predicted Structures of all possible AraC‐PDZ hybrids (corresponding to figure S14) are available as PDB files on Github. Plasmids encoding the AraC‐I113‐LOV, AraC‐S170‐LOV, AraC‐S170‐LOV_G141D and AraC‐S170‐LOV_T50S are available on Addgene (Addgene‐IDs: #206804; #206805).

References

  • 1. Ponting C. P., Russell R. R., Annu. Rev. Biophys. Biomol. Struct. 2002, 31, 45. [DOI] [PubMed] [Google Scholar]
  • 2. Jin J., Xie X., Chen C., Park J. G., Stark C., James D. A., Olhovsky M., Linding R., Mao Y., Pawson T., Sci Signal 2009, 2, p ra76. [DOI] [PubMed] [Google Scholar]
  • 3. Dagliyan O., Tarnawski M., Chu P. H., Shirvanyants D., Schlichting I., Dokholyan N. V., Hahn K. M., Science 2016, 354, 1441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Siegel M. S., Isacoff E. Y., Neuron 1997, 19, 735. [DOI] [PubMed] [Google Scholar]
  • 5. Lee J., Natarajan M., Nashine V. C., Socolich M., Vo T., Russ W. P., Benkovic S. J., Ranganathan R., Science 2008, 322, 438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ostermeier M., Protein Eng Des Sel 2005, 18, 359. [DOI] [PubMed] [Google Scholar]
  • 7. Guntas G., Mansell T. J., Kim J. R., Ostermeier M., Proc. Natl. Acad. Sci. USA 2005, 102, 11224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Dagliyan O., Dokholyan N. V., Hahn K. M., Nat. Protoc. 2019, 14, 1863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Hoffmann M. D., Mathony J., Upmeier Zu Belzen J., Harteveld Z., Aschenbrenner S., Stengl C., Grimm D., Correia B. E., Eils R., Niopek D., Nucleic Acids Res. 2021, 49, e29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Bubeck F., Hoffmann M. D., Harteveld Z., Aschenbrenner S., Bietz A., Waldhauer M. C., Börner K., Fakhiri J., Schmelas C., Dietz L., Grimm D., Correia B. E., Eils R., Niopek D., Nat. Methods 2018, 15, 924. [DOI] [PubMed] [Google Scholar]
  • 11. Oakes B. L., Nadler D. C., Flamholz A., Fellmann C., Staahl B. T., Doudna J. A., Savage D. F., Nat. Biotechnol. 2016, 34, 646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Nadler D. C., Morgan S. A., Flamholz A., Kortright K. E., Savage D. F., Nat. Commun. 2016, 7, 12266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Reynolds K. A., McLaughlin R. N., Ranganathan R., Cell 2011, 147, 1564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Edwards W. R., Busse K., Allemann R. K., Jones D. D., Nucleic Acids Res. 2008, 36, e78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Coyote‐maestas W., Nedrud D., Okorafor S., He Y., Schmidt D., Nucleic Acids Res. 2019, 48, e11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Coyote‐Maestas W., He Y., Myers C. L., Schmidt D., Nat. Commun. 2019, 10, 290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Coyote‐Maestas W., Nedrud D., Suma A., He Y., Matreyek K. A., Fowler D. M., Carnevale V., Myers C. L., Schmidt D., Nat. Commun. 2021, 12, 7114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Fernandez‐Rodriguez J., Voigt C. A., Nucleic Acids Res. 2016, 44, 6493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Ormö M., Cubitt A. B., Kallio K., Gross L. A., Tsien R. Y., Remington S. J., Science 1996, 273, 1392. [DOI] [PubMed] [Google Scholar]
  • 20. Dagliyan O., Shirvanyants D., Karginov A. V., Ding F., Fee L., Chandrasekaran S. N., Freisinger C. M., Smolen G. A., Huttenlocher A., Hahn K. M., Dokholyan N. V., Proc. Natl. Acad. Sci. USA 2013, 110, 6800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., Bridgland A., Meyer C., Kohl S. A. A., Ballard A. J., Cowie A., Romera‐Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., et al., Nature 2021, 596, 583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Mirdita M., Schütze K., Moriwaki Y., Heo L., Ovchinnikov S., Steinegger M., Nat. Methods 2022, 19, 679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., Kanehisa M., Nucleic Acids Res. 2008, 36, D202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Kawashima S., Kanehisa M., Nucleic Acids Res. 2000, 28, 374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Suyama M., Ohara O., Bioinform. 2003, 19, 673. [DOI] [PubMed] [Google Scholar]
  • 26. George R. A., Heringa J., Protein Eng Des Sel 2002, 15, 871. [DOI] [PubMed] [Google Scholar]
  • 27. Bae K., Mallick B. K., Elsik C. G., Bioinform. 2005, 21, 2264. [DOI] [PubMed] [Google Scholar]
  • 28. Akdel M., Pires D. E. V., Pardo E. P., Jänes J., Zalevsky A. O., Mészáros B., Bryant P., Good L. L., Laskowski R. A., Pozzati G., Shenoy A., Zhu W., Kundrotas P., Serra V. R., Rodrigues C. H. M., Dunham A. S., Burke D., Borkakoti N., Velankar S., Frost A., Basquin J., Lindorff‐Larsen K., Bateman A., Kajava A. V., Valencia A., Ovchinnikov S., Durairaj J., Ascher D. B., Thornton J. M., Davey N. E., et al., Nat. Struct. Mol. Biol. 2022, 29, 1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Friedman J. H., Comput Stat Dat Anal. 2002, 38, 367. [Google Scholar]
  • 30. Louppe G., 10.48550/arXiv.1407.7502, 2015.
  • 31. Mathony J., Niopek D., Adv. Biol. 2021, 5, 2000181. [DOI] [PubMed] [Google Scholar]
  • 32. Soisson S. M., MacDougall‐Shackleton B., Schleif R., Wolberger C., Science 1997, 276, 421. [DOI] [PubMed] [Google Scholar]
  • 33. Schleif R., FEMS Microbiol. Rev. 2010, 34, 779. [DOI] [PubMed] [Google Scholar]
  • 34. Halabi N., Rivoire O., Leibler S., Ranganathan R., Cell 2009, 138, 774. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Romano E., Baumschlager A., Akmeriç E. B., Palanisamy N., Houmani M., Schmidt G., Öztürk M. A., Ernst L., Khammash M., Di Ventura B., Nat. Chem. Biol. 2021, 17, 817. [DOI] [PubMed] [Google Scholar]
  • 36. Dietler J., Schubert R., Krafft T. G. A., Meiler S., Kainrath S., Richter F., Schweimer K., Weyand M., Janovjak H., Möglich A., J. Mol. Biol. 2021, 433, 167107. [DOI] [PubMed] [Google Scholar]
  • 37. Li X., Zhang C., Xu X., Miao J., Yao J., Liu R., Zhao Y., Chen X., Yang Y., Nucleic Acids Res. 2020, 48, e33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Jayaraman P., Devarajan K., Chua T. K., Zhang H., Gunawan E., Poh C. L., Nucleic Acids Res. 2016, 44, 6994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Jayaraman P., Yeoh J. W., Zhang J., Poh C. L., ACS Synth. Biol. 2018, 7, 2627. [DOI] [PubMed] [Google Scholar]
  • 40. Engler C., Kandzia R., Marillonnet S., PLoS One 2008, 3, e3647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Campbell R. E., Tour O., Palmer A. E., Steinbach P. A., Baird G. S., Zacharias D. A., Tsien R. Y., Proc. Natl. Acad. Sci. USA 2002, 99, 7877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. McGinness K. E., Baker T. A., Sauer R. T., Mol. Cell 2006, 22, 701. [DOI] [PubMed] [Google Scholar]
  • 43. Kapust R. B., Tözsér J., Fox J. D., Anderson D. E., Cherry S., Copeland T. D., Waugh D. S., Protein Eng. Des. Sel. 2001, 14, 993. [DOI] [PubMed] [Google Scholar]
  • 44. Sun P., Austin B. P., Tözsér J., Waugh D. S., Protein Sci. 2010, 19, 2240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Bervoets I., Van Brempt M., Van Nerom K., Van Hove B., Maertens J., De Mey M., Charlier D., Nucleic Acids Res. 2018, 46, 2133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Goddard T. D., Huang C. C., Meng E. C., Pettersen E. F., Couch G. S., Morris J. H., Ferrin T. E., Protein Sci. 2018, 27, 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Pettersen E. F., Goddard T. D., Huang C. C., Meng E. C., Couch G. S., Croll T. I., Morris J. H., Ferrin T. E., Protein Sci. 2021, 30, 70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., Lipman D. J., Nucleic Acids Res. 1997, 25, 3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J., J. Mol. Biol. 1990, 215, 403. [DOI] [PubMed] [Google Scholar]
  • 50. Edgar R. C., Nucleic Acids Res. 2004, 32, 1792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Bairoch A., Apweiler R., J Mol Med (Berl) 1997, 75, 312. [PubMed] [Google Scholar]
  • 52. Teşileanu T., Colwell L. J., Leibler S., PLoS Comput. Biol. 2015, 11, e1004091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay É., J Mach Learn Res 2011, 12, 2825. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Data Availability Statement

The data that support the findings of this study are available in the supplementary material of this article. The computational analysis, as well as experimental raw data are available at Github under: https://github.com/Niopek-Lab/DI_screen. The structures shown in the figures are including all color codes are provided on the Github repository as ChimeraX session files. The AF2‐predicted Structures of all possible AraC‐PDZ hybrids (corresponding to figure S14) are available as PDB files on Github. Plasmids encoding the AraC‐I113‐LOV, AraC‐S170‐LOV, AraC‐S170‐LOV_G141D and AraC‐S170‐LOV_T50S are available on Addgene (Addgene‐IDs: #206804; #206805).


Articles from Advanced Science are provided here courtesy of Wiley

RESOURCES