Abstract
The CONSTANS-like (COL) transcription factors integrate photoperiod cues with developmental regulation in plants, yet the evolutionary forces shaping their structural diversity remain poorly understood. Here, the evolutionary history of COL5 was reconstructed across 31 Brassicaceae genomes using a curated set of 284 high-confidence orthologs validated for domain architecture, alignment quality, and absence of substitution saturation. Branch-specific codon models identified a single episodically selected lineage within Arabidopsis thaliana, and site-level analyses mapped two non-synonymous amino-acid replacements uniquely acquired along this branch. Ancestral sequence reconstruction recovered the historical residues at both positions with posterior probability 1.0, enabling controlled reverse-evolution mutagenesis. Reintroduction of these ancestral states into the modern COL5 protein revealed a profound biophysical impact, Rosetta ΔΔG values indicated strong destabilization, and 100-ns molecular dynamics simulations showed large increases in structural deviation, compaction, loss of flexibility, and significantly elevated potential energy. These results demonstrate that the derived residues stabilize the contemporary COL5 fold, whereas the ancestral residues are incompatible with the evolved structural background. The findings provide direct mechanistic evidence that episodic positive selection on COL5 produced a lasting shift in protein stability and conformational dynamics, illustrating how adaptive molecular evolution can reshape protein energy landscapes and entrench derived states through historical contingency.
Keywords: Arabidopsis thaliana, CONSTANS-like (COL), Episodic positive selection, Ancestral sequence reconstruction, aBSREL, MEME
Subject terms: Molecular evolution, Computational biology and bioinformatics
Introduction
The CONSTANS-like (COL) transcription factors form a deeply conserved gene family that regulates photoperiodic signaling, circadian integration and developmental timing across angiosperms1. These functions rely on the canonical COL protein architecture, which couples one or two N-terminal B-box zinc-finger domains mediating protein–protein interactions with a C-terminal CCT motif responsible for nuclear localization and transcriptional control2. Although these domains are highly conserved, COL genes nonetheless show lineage-specific diversification across plants, suggesting that adaptive pressures have periodically modified their regulatory functions in response to ecological and environmental change3. Among Brassicaceae, where extensive genome resources are available, the COL family is particularly amenable to comparative evolutionary analyses aimed at understanding how photoperiod-responsive regulators evolve and adapt. Previous comparative studies have indicated that individual COL genes may experience heterogeneous selective pressures, with most members evolving under strong purifying constraint while a minority exhibit signatures of episodic positive selection3. A pilot analysis of the Arabidopsis COL family recently suggested that COL5, in particular, may have undergone an isolated burst of adaptive evolution, with scattered residues showing potential functional relevance in light-dependent signaling pathways4. However, that initial investigation, while suggestive, did not resolve the precise location, direction, or mechanistic impact of these putatively adaptive substitutions. As a result, the extent to which positive selection remodeled COL5’s structural or dynamic properties, and whether such remodeling contributed to functional optimization, has remained unknown.
A mechanistic link between historical adaptive substitutions and contemporary protein behavior is challenging to establish, because sequence signatures of selection do not indicate how individual amino-acid replacements influence folding, stability or conformational dynamics. Such relationships can only be resolved by integrating phylogenetically robust site identification with ancestral sequence reconstruction and experimental or computational resurrection of historical states5. Reverse-evolution mutagenesis, in which ancestral residues are reintroduced into modern proteins, provides a direct means to test whether derived substitutions confer specific biophysical advantages, and whether ancestral states are now incompatible with the evolved structural background6–8. When combined with atomistic molecular dynamics simulations and energy-based modeling, this approach enables causal inference linking episodic selection to measurable changes in protein energetics and conformational landscapes7,9.
In this study, the evolutionary history of COL5 was reconstructed across 31 Brassicaceae genomes using a high-confidence ortholog dataset validated for domain architecture, alignment quality and the absence of substitution saturation. Episodic selection was localized to a single Arabidopsis COL5 lineage, and site-level inference identified two derived amino-acid replacements that arose specifically along this branch. By resurrecting their ancestral states in the modern protein and assessing the resulting structural and energetic consequences through Rosetta stability modeling and molecular dynamics simulations, the biophysical legacy of this adaptive episode was directly tested. This combined evolutionary-biophysical framework allows the mechanistic role of positively selected residues to be elucidated and provides a model for understanding how episodic molecular evolution can restructure protein stability, dynamics and functional potential.
Materials and methods
Identification of arabidopsis COL genes and curation of orthologs
COL candidate genes were first retrieved from the Arabidopsis thaliana genome (Araport11) in Phytozome v1410 using the keyword “CONSTANS-like”. This search returned an initial set of twelve annotated proteins. All twelve candidates were screened using InterProScan v5.66–89.011 with Pfam, SMART and ProSiteProfiles databases enabled to determine the presence of the two diagnostic domains of the COL family, the N-terminal B-box zinc-finger domain (PF00643) and the C-terminal CCT motif (PF06203). Only proteins simultaneously possessing both domains were classified as bona fide COL genes. Ten of the twelve retrieved proteins met these criteria and were retained. Domain boundaries and overall B-box–CCT architecture were validated using custom Python scripts to ensure positional accuracy and eliminate partial or truncated annotations. These ten rigorously validated A. thaliana COL proteins served as the reference query set for ortholog identification.
To curate COL orthologs across Brassicaceae, the ten Arabidopsis reference proteins were used in a reciprocal best-hit (RBH) workflow optimized following the empirically validated parameters of Moreno-Hagelsieb and Latimer12,13. Similarity searches were conducted using BLAST + v2.11.0 (blastp) against 31 Brassicaceae proteomes available in Phytozome v14, with Carica papaya included as an outgroup. Searches used soft masking, Smith–Waterman traceback, an E-value threshold of 1e-5, and a maximum of 500 targets per query. Candidate orthologs were retained only when both query and subject coverage exceeded 90%, and strict one-to-one RBH pairs were accepted as true orthologs12. All resulting protein sequences were re-examined with InterProScan to confirm the presence of both the B-box and CCT domains; sequences lacking either domain was excluded. Corresponding coding sequences (CDS) were retrieved from Phytozome and verified for exact protein-CDS correspondence using Biopython14. This workflow produced a final curated dataset of 283 high-confidence unique COL orthologs spanning 31 Brassicaceae species and 1 ortholog for the outgroup.
Multiple sequence alignment, quality filtering and saturation testing
Protein sequences were aligned using PRANK15, a phylogeny-aware method designed to minimize alignment artifacts. Codon alignments were generated through back-translation using PAL2NAL16 to preserve reading-frame structure. Alignments were inspected in AliView v1.3017, and ambiguous regions were removed using ClipKIT18 under the “kpic-smart-gap” mode. CDS were subjected to a strict quality-control pipeline implemented in Python. Sequences lacking an ATG start codon, an in-frame terminal stop codon, or containing premature internal stops were removed4. Only sequences with lengths divisible by three and > 300 nucleotides were retained, and outliers in CDS length were removed using interquartile range filtering. Synonymous substitution saturation was assessed using Xia’s Iss test implemented in DAMBE19,20. Maximum-likelihood phylogenies were constructed using IQ-TREE 321 with the best-fit substitution model selected by ModelFinder22, and node support was estimated using 1,000 ultrafast bootstrap replicates. Only non-saturated alignments were used for downstream evolutionary analyses.
Branch-level and site-level selection analyses
To screen for lineage-specific episodic positive selection in Arabidopsis COL genes, a multi-stage HyPhy23,24 workflow was applied. First, an exploratory BUSTED analysis25 was performed using all branches corresponding to A. thaliana COL genes as the designated foreground set. This exploratory step tested whether any Arabidopsis COL lineage exhibited evidence of episodic positive selection when compared against the rest of the Brassicaceae phylogeny. BUSTED identified significant episodic selection affecting at least one site on at least one foreground branch (P < 0.05 after Benjamini–Hochberg correction), motivating further branch-level dissection.
Based on this result, a more fine-grained analysis was conducted using aBSREL26, with the same set of Arabidopsis branches specified as foreground. aBSREL evaluates each foreground branch individually for evidence of episodic diversifying selection. After false-discovery-rate correction, only a single branch—corresponding to the A. thaliana COL5 gene (AT5G57660)—showed statistically significant evidence of episodic positive selection (P < 0.05, FDR-adjusted). No other Arabidopsis COL branches met the significance threshold.
Given that COL5 was the only positively selected lineage, site-level analyses were restricted to this gene. MEME27 was run with the COL5 branch specified as the foreground to identify individual codon sites subject to episodic diversifying selection within this branch. MEME detected three sites evolving under episodic positive selection in COL5 (FDR-corrected P < 0.05). These MEME-identified sites were subsequently mapped to the reconstructed COL5 ancestral sequence and to the modern A. thaliana COL5 protein to determine the direction and nature of amino-acid changes along the positively selected branch.
Mapping of positively selected sites to ancestral states
MEME results were used solely as a list of candidate sites; the evolutionary direction of change was determined only after ancestral reconstruction. Ancestral sequences were reconstructed in IQ-TREE 3 under the best-fit amino-acid substitution model using the “–asr” option21. Residues with posterior probability ≥ 0.90 were treated as high confidence. To establish precise correspondence between extant and ancestral positions, MEME-detected sites were mapped using a coordinate reconciliation workflow. This workflow aligned trimmed CDS to proteins, aligned trimmed proteins to the ungapped extant A. thaliana protein, and finally aligned the extant protein to the reconstructed ancestor using a Needleman–Wunsch global aligner28.
For each selected site, ancestral and modern amino acids were compared. Only codons showing a non-synonymous change between the ancestor and the extant A. thaliana sequence were considered evolutionarily meaningful. Among three MEME-identified sites, two positions, 274 and 275, exhibited clear derived amino acids in the modern protein, whereas one site, 194, remained unchanged. The two non-synonymous sites were retained for functional reverse-evolution experiments.
Reverse-evolution mutagenesis
To test the functional consequences of adaptive evolution, ancestral amino acids from the reconstructed COL5 ancestor were introduced into the extant A. thaliana COL5 protein at the two positively selected sites, 274 and 275. This “reverse-evolution”6 approach evaluates how the modern protein would behave if the ancestral residues, presumed to represent the pre-adaptive state, were restored.
Site-directed mutagenesis was used to introduce ancestral residues A and T at modern positions 274 (T) and 275 (G) (ancestral positions 228 and 229). The resulting mutant (MUT) construct represents a modern protein carrying ancestral states at adaptively evolving sites, enabling a direct comparison with the wild type (WT) modern COL5 protein.
Rosetta ΔΔG stability analyses
Structural stability of WT and MUT proteins was predicted using PyRosetta29. Models were refined with the Cartesian FastRelax protocol under the ref2015_cart energy function. Each model underwent five independent relaxation cycles, and minimized total energies were recorded in Rosetta Energy Units (REU). Mutational effects were computed as ΔΔG = E_mut − E_WT, with positive values indicating destabilization.
Molecular dynamics simulations using OpenMM
Atomistic molecular dynamics simulations were performed with OpenMM30. Systems were parameterized using the AMBER14 force field and TIP3P water, solvated with a 1.0-nm buffer and 0.15 M NaCl. After energy minimization, systems were equilibrated for 50 ps using a Langevin thermostat at 310 K with 2-fs timesteps. Production simulations were performed for 100 ns, saving coordinates every 5,000 steps.
Trajectory analysis was carried out using MDAnalysis v2 with a memory-efficient streaming workflow. RMSD, cross-RMSD, radius of gyration (Rg) and per-residue Cα RMSF were computed using an online Welford variance accumulator. Global RMSF differences and free-energy distributions were compared via Mann–Whitney U test.
Statistical analyses
All statistical analyses were performed in R.
Results
Identification and curation of 284 CONSTANS-LIKE orthologs across Brassicaceae
A keyword-based search for “CONSTANS-like” genes in the A. thaliana genome (Phytozome v14, Araport11 annotation) initially retrieved twelve candidate loci. Domain-architecture validation using InterProScan revealed that only ten of these possessed both diagnostic features of the CONSTANS-LIKE family, the N-terminal B-box zinc-finger and the C-terminal CCT motif, and these ten were retained as the authentic A. thaliana COL gene set.
These ten curated Arabidopsis sequences were subsequently used as queries in a RBH search across 31 Brassicaceae proteomes and one outgroup (Carica papaya). After applying stringent coverage and domain-based filters to remove partial or truncated matches, a total of 284 unique one-to-one COL orthologs were identified. All retained sequences exhibited the conserved B-box–CCT domain organization characteristic of the COL family, and each species contributed only a single ortholog at maximum per Arabidopsis query gene. These high-confidence ortholog group formed the basis for all downstream phylogenetic, evolutionary and ancestral-state analyses.
Robust alignment shows no substantial of substitution saturation
All 284 curated CONSTANS-LIKE coding sequences passed quality-control filtering, with no sequences removed due to missing start codons, premature stop codons, internal frame disruptions, or aberrant lengths. The final dataset therefore consisted of 284 high-confidence CDS representing all orthologs retained after RBH curation. PRANK-generated protein alignments and PAL2NAL codon back-translations yielded well-aligned datasets with no extensive ambiguously aligned regions, and ClipKIT trimming removed only gap-prone, low-information sites without altering overall alignment structure. Maximum-likelihood phylogenetic reconstruction produced a stable, well-resolved topology consistent with expected Brassicaceae relationships, indicating that the alignment was suitable for downstream evolutionary analyses (Fig. 1).
Fig. 1.
COL gene tree and mapping residues in COL5. (a) Episodic selection on the COL5 lineage (highlighted with red) and (b) mapping of selected residues to extant and reconstructed ancestral sequences.
Substitution saturation was evaluated using Xia’s Iss test implemented in DAMBE, and numerical results were examined explicitly for each taxon subset (4, 8, 16 and 32 OTUs). For all subsets, the observed Iss values (0.431–0.444) were well below the corresponding critical Iss.c values under the symmetrical topology model (Iss.cSym = 0.608–0.786). This indicates a statistically significant difference between Iss and Iss.cSym across all comparisons (T = 3.308–7.088; df = 131; P = 0.0000–0.0012), demonstrating little to no substitution saturation under symmetric assumptions. Under the more conservative asymmetric topology model, the observed Iss values also remained consistently lower than or comparable to the asymmetric critical values (Iss.cAsym = 0.460–0.804). For subsets of 4 and 8 OTUs, Iss < Iss.cAsym with strong significance (T = 4.917–7.454; P = 0.0000). For 16- and 32-OTU subsets, Iss values approached but did not exceed their respective Iss.cAsym cutoffs (e.g., Iss = 0.431 vs Iss.cAsym = 0.460 at 16 OTUs; T = 0.554; P = 0.5804), indicating that even in the largest taxon subsets saturation was not detected. Across all analyses, Iss never exceeded Iss.cSym or Iss.cAsym, and in no case was the null hypothesis of equal saturation rejected in the direction consistent with saturation. Thus, the dataset shows no substantial substitution saturation, confirming that the codon alignment retains robust phylogenetic signal suitable for downstream ML tree building and codon-model selection tests.
Hyphy detected COL5 as episodically selected branch with three sites under positive selection
An exploratory BUSTED analysis that designated all Arabidopsis CONSTANS-LIKE branches as the foreground returned a significant gene-wide signal (P < 0.05), prompting branch-wise tests with aBSREL, which identified only the COL5 branch (AT5G57660) as episodically selected after FDR correction (P = 0.015); aBSREL inferred two ω classes with the majority of sites under purifying evolution (ω = 0.2001; 98.887% of sites) and a small fraction under positive selection (ω = 299.3; 1.1129% of sites; mean ω = 3.529, CoV = 8.892). Site-level inference using MEME, run with the COL5 branch as the foreground, detected three codons under episodic diversifying selection reported at positions 265, 452 and 453 in the trimmed (gapped) multiple-sequence alignment (MSA), each characterized by very large β⁺ estimates (52,353.55; 7624.62; 4929.70 respectively), significant LRTs (LRT = 18.71, 8.11, 8.26) and posterior probabilities p⁺ ≈ 0.99. These MEME-reported (gapped MSA) positions were then mapped to the ungapped extant COL5 protein and the reconstructed COL5 ancestor: MSA position 265 corresponds to extant residue 194 and ancestral residue 143 (H to H), MSA position 452 corresponds to extant residue 274 and ancestral residue 228 (ancestral A to modern T), and MSA position 453 corresponds to extant residue 275 and ancestral residue 229 (ancestral T to modern G) (Table 1; Fig. 2). Thus, only the two MSA-reported sites at trimmed positions 452 and 453 (extant positions 274 and 275) represent non-synonymous, derived substitutions on the COL5 branch and were selected for subsequent reverse-evolution mutagenesis and comparative biophysical analysis, whereas the MSA-reported site at position 265 is invariant between ancestor and extant sequences and was not pursued experimentally.
Table 1.
Positively selected sites in COL5 and their extant vs ancestral residue identities.
| MEME site (trimmed MSA) | Extant position | Extant AA | Ancestral position | Ancestral AA | Posterior probability | Syn/Non-syn | Note |
|---|---|---|---|---|---|---|---|
| 265 | 194 | H | 143 | H | 1.00 | Synonymous | Not used for mutagenesis |
| 452 | 274 | T | 228 | A | 1.00 | Non-synonymous | Used for mutagenesis |
| 453 | 275 | G | 229 | T | 1.00 | Non-synonymous | Used for mutagenesis |
Fig. 2.

Comparison of Rosetta total energies (REU) between WT and the ancestral-reversion mutant. Boxplots show ten independent Rosetta ΔΔG calculations per variant, with individual replicates overlaid as jittered points. Lower REU values indicate greater predicted stability. Statistical significance was assessed using the Wilcoxon rank-sum test, with the p-value displayed above the plot.
Ancestral sequence reconstruction confirms high-confidence ancestral states at selected sites
Ancestral reconstruction of the COL5 lineage was performed using the amino-acid alignment of 284 sequences comprising 625 sites, of which 560 were parsimony-informative and 65 were constant; ModelFinder identified Q.MAMMAL + I + R5 as the best-fitting substitution model under BIC, and this model was used for maximum-likelihood inference in IQ-TREE. The reconstructed COL5 ancestral sequence was fully resolved across the entire alignment, including the canonical B-box and CCT motifs, which were perfectly conserved and showed no ambiguity in ancestral state assignment. At the two MEME-identified non-synonymous positions used for downstream mutagenesis (ancestral 228 and 229), the inferred amino-acid states were recovered with posterior probability = 1.0, indicating maximal statistical confidence. These unambiguous ancestral states (A at position 228 and T at position 229) provided a robust basis for reverse-evolution mutagenesis, ensuring that the experimentally introduced residues faithfully reflect the ancestral condition prior to the derived substitutions that arose along the positively selected COL5 branch.
Rosetta stability analysis reveals strong destabilization caused by the ancestral double reversion
Rosetta-based stability predictions showed that restoring both ancestral residues simultaneously at positions 274 and 275 (T274A/G275T) produced a consistently and markedly destabilizing effect on the COL5 protein. Across ten independent Cartesian FastRelax replicates (Fig. 2), the wild-type COL5 model exhibited an average ΔΔG of 50.90 REU, far exceeding the + 0.5 REU threshold typically interpreted as meaningful destabilization. This pervasive increase in predicted energy across all replicates indicates that the ancestral combination of residues at positions 274 and 275, although historically present, is markedly less compatible with the modern COL5 structural background. The results therefore suggest that the derived modern residues at these sites contribute significantly to the stability of the contemporary protein fold and that the ancestral-to-derived substitutions along the positively selected branch of COL5 likely conferred a stabilizing adaptive advantage.
Molecular dynamics simulations reveal widespread structural and energetic disruption caused by the ancestral double reversion
Analysis of the 100-ns atomistic simulations demonstrated that introduction of the ancestral residues at positions 274 and 275 substantially perturbs the dynamical behavior of COL5 (Table 2; Fig. 3). The mutant exhibited markedly elevated self-aligned backbone RMSD (mean = 20.522 Å, SD = 3.492) relative to the wild type (mean = 17.181 Å, SD = 3.239), indicating reduced structural persistence throughout the simulation. Cross-RMSD mapping the mutant trajectory onto the wild-type conformational ensemble averaged 27.390 Å (SD = 1.127), showing that the mutant explores structural states far removed from the WT landscape. Despite these large-scale deviations, the mutant adopted a more compact global geometry, with a reduced radius of gyration (77.899 Å) compared to the WT (81.929 Å). Residue-level Cα RMSF values further revealed a pronounced damping of flexibility in the mutant (mean = 73.255 Å) compared with WT (104.386 Å). Energetically, the WT trajectory sampled substantially lower mean potential energy (− 7,065,654 kJ/mole) than the mutant (− 6,069,270 kJ/mole), indicated a robust energetic penalty associated with the ancestral reversion. Collectively, these simulations show that the double mutant destabilizes the global conformational landscape, reduces flexibility, alters compactness, and incurs a significant energetic cost, providing dynamical support for the strong destabilization predicted by Rosetta.
Table 2.
Summary of global structural and energetic differences between the wild-type (WT) protein and the ancestral double mutant (T274A/G275T) derived from 100-ns molecular dynamics simulations. Global conformational descriptors, including backbone RMSD, cross-RMSD to the WT ensemble, and radius of gyration, are reported descriptively as mean ± standard deviation to characterize differences in conformational behavior. Residue-level RMSF values are presented as descriptive measures of flexibility, as they are derived from single trajectories and do not constitute independent statistical samples. Potential energy distributions were compared cautiously, acknowledging temporal autocorrelation inherent to molecular dynamics trajectories.
| Metric | WT (mean ± SD) | MUT (mean ± SD) | Statistical test | Interpretation |
|---|---|---|---|---|
| Backbone RMSD (Å) | 17.181 ± 3.239 | 20.522 ± 3.492 | Descriptive | Mutant exhibits reduced structural persistence and greater deviation |
| Cross-RMSD to WT ensemble (Å) | – | 27.390 ± 1.127 | Descriptive | Mutant explores conformational space far from WT baseline |
| Radius of gyration (Å) | 81.929 ± 0.119 | 77.899 ± 0.204 | Descriptive | Mutant exhibits more compactness over WT |
| Residue-level RMSF (Å) | 104.386 ± 2.121 | 73.255 ± 2.608 | Descriptive | Mutant shows globally dampened flexibility |
| Potential energy (kJ/mol) | − 7,065,654 ± 2548.477 | − 6,069,270 ± 2368.697 | Descriptive | Mutant occupies a significantly higher-energy ensemble |
Fig. 3.
Molecular dynamics analysis of the structural consequences of the ancestral double reversion in COL5. (a) The temporal evolution of the radius of gyration (Rg) for WT and the ancestral-reversion mutant (MUT) over the 100-ns simulation. WT maintained a consistently larger Rg (~ 82 Å), whereas MUT adopted a more compact conformation (~ 78 Å) throughout the trajectory, indicating global structural compaction upon introduction of ancestral residues at positions 274 and 275. (b) Backbone RMSD profiles comparing WT aligned to its own initial structure (WT self) with the mutant trajectory aligned to the WT reference (MUT to WT). WT converged to a stable plateau between 15 and 20 Å, whereas MUT consistently deviated by ~ 25–30 Å from the WT conformational ensemble, demonstrating that the double reversion shifts the structure into a distinct region of conformational space. (c) Self-aligned backbone RMSD of WT and MUT, calculated by aligning each trajectory to its own first frame. Although both systems displayed gradual structural drift, MUT exhibited higher RMSD amplitudes (18–23 Å) than WT (15–20 Å), indicating reduced structural persistence and greater conformational wandering in the revertant background. (d) Per-residue Cα RMSF profiles for WT and MUT across the full 100-ns simulation. MUT showed uniformly lower RMSF values (~ 70–80 Å) relative to WT (~ 100–110 Å), revealing a global reduction in residue-level flexibility. The attenuated mobility across the protein suggests that the ancestral reversion rigidifies the conformational ensemble despite the destabilizing energetic consequences observed in global stability analyses.
Discussion
The present study provides a comprehensive mechanistic demonstration of how episodic positive selection shaped the evolution of the Arabidopsis COL5 protein. The analysis was built upon a high-resolution phylogenomic dataset comprising 284 Brassicaceae orthologs, free of substitution saturation and supported by a fully resolved maximum-likelihood topology, ensuring that subsequent evolutionary inference was drawn from a robust comparative framework. Within this framework, COL5 was confirmed as a lineage exhibiting episodic diversifying selection, in agreement with an earlier pilot study that initially suggested COL5 as an evolutionary outlier within the Arabidopsis COL family4. The present work expands substantially beyond that preliminary observations.
Two of the three sites detected under episodic selection were found to correspond to genuine historical amino-acid replacements that arose along the COL5 branch. Ancestral reconstruction assigned these historical states with posterior probability 1.0, permitting unambiguous determination of the evolutionary direction of change. Their reintroduction into the modern COL5 protein therefore provided a controlled approach for assessing the functional relevance of the selected residues. The biophysical analyses revealed that the ancestral double reversion produced a large and reproducible energetic penalty, with Rosetta ΔΔG values exceeding + 50 REU on average and reaching + 85 REU in individual replicates. These findings demonstrate that the modern COL5 fold has become strongly dependent on the derived residues at positions 274 and 275, and that the ancestral configuration is no longer structurally compatible. Such behaviour is characteristic of evolutionary entrenchment, in which adaptive substitutions become locked into place by subsequent structural or functional refinements.
The destabilizing effect of the ancestral residues was further clarified by molecular dynamics simulations, which showed that the mutant adopts conformational states far removed from the WT ensemble, exhibits markedly elevated RMSD values, and samples an energetically less favourable landscape. The global damping of residue-level mobility and the reduced radius of gyration indicate that the ancestral mutant collapses into a more constrained and frustrated conformational ensemble. Such an ensemble is likely to be structurally, energetically, and potentially functionally suboptimal, particularly because COL5 is a transcription factor that requires substantial conformational flexibility to remain biologically functional.
These results provide atomistic support for the conclusion that the adaptive substitutions introduced along the COL5 branch helped shape a more stable and dynamically coherent protein. This stabilizing effect contrasts with the behaviour inferred from earlier, more limited analyses in the pilot study, in which only a single substitution was assessed; the present work shows that the combined action of multiple historically selected residues has produced a substantially more pronounced structural shift.
Conclusion
This study demonstrates that the CONSTANS-like gene COL5 in A. thaliana underwent a discrete episode of adaptive evolution that resulted in two derived amino-acid substitutions with lasting structural consequences. Using a high-quality, saturation-free Brassicaceae-wide ortholog dataset and rigorous HyPhy analyses, the study pinpointed these substitutions to a single positively selected branch and mapped them unambiguously to their ancestral states. Reverse-evolution mutagenesis revealed that restoring the ancestral residues severely destabilizes the contemporary COL5 protein, as evidenced by large energetic penalties in Rosetta and profound perturbations in conformational behavior during molecular dynamics simulations. These findings show that the derived residues are now structurally entrenched within the modern protein fold and that the adaptive substitutions likely conferred stabilizing advantages that reshaped the COL5 energy landscape. More broadly, the work illustrates how episodic positive selection can drive functionally consequential structural transitions and how ancestral sequence reconstruction, evolutionary inference and atomistic modeling can be integrated to uncover the mechanistic legacy of adaptive molecular evolution.
Acknowledgements
I thank the global open-access community for their dedication to free science and for breaking down barriers to knowledge without which this study was not possible. I appreciate those who work to democratize information, driving progress and empowerment worldwide. I also acknowledge the freely available language models that helped refine this manuscript. Finally, I am deeply grateful to my wife, Dr. Nabanita Ghosh, Assistant Professor of Zoology at Maulana Azad College, Kolkata, for her insightful suggestions.
Author contributions
K.S. conceived the idea, collected the data, analyzed the data, wrote the main and revised manuscript texts, and prepared the tables and figures.
Data availability
All data, computational analyses, and associated materials generated in this study are publicly available at Zenodo under the DOI: 1 0. 5 2 81/zenodo.17682440.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Khatun, K. et al. Genome-wide identification, genomic organization, and expression profiling of the CONSTANS-like (COL) gene family in petunia under multiple stresses. BMC Genom.22, 1–17 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ma, S. et al. Identification and characterization of the CONSTANS-like gene family and its expression profiling under salt treatment in alfalfa (Medicagosativa L.). Plant Gene44, 100544 (2025). [Google Scholar]
- 3.Griffiths, S., Dunford, R. P., Coupland, G. & Laurie, D. A. The evolution of CONSTANS-like gene families in Barley, Rice, and Arabidopsis. Plant Physiol.131, 1855 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sinha, K. Episodic positive selection signatures in arabidopsis CONSTANS-like genes COL3 and COL5 indicating adaptive evolution in red-light signaling pathways. bioRxiv 2025.04.03.646976. 10.1101/2025.04.03.646976 (2025).
- 5.Hobbs, J. K., Prentice, E. J., Groussin, M. & Arcus, V. L. Reconstructed ancestral enzymes impose a fitness cost upon modern bacteria despite exhibiting favourable biochemical properties. J. Mol. Evol.81, 110–120 (2015). [DOI] [PubMed] [Google Scholar]
- 6.Kaltenbach, M., Jackson, C. J., Campbell, E. C., Hollfelder, F. & Tokuriki, N. Reverse evolution leads to genotypic incompatibility despite functional and active site convergence. Elife4, e06492 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hochberg, G. K. A. & Thornton, J. W. Reconstructing ancient proteins to understand the causes of structure and function. Annu. Rev. Biophys.46, 247–269 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Risso, V. A. et al. Mutational studies on resurrected ancestral proteins reveal conservation of site-specific amino acid preferences throughout evolutionary history. Mol. Biol. Evol.32, 440 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bloom, J. D. & Arnold, F. H. In the light of directed evolution: Pathways of adaptive protein evolution. Proc. Natl. Acad. Sci. U. S. A.106(Suppl 1), 9995–10000 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Goodstein, D. M. et al. Phytozome: A comparative platform for green plant genomics. Nucleic Acids Res.40, D1178–D1186 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Jones, P. et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics30, 1236–1240 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Moreno-Hagelsieb, G. & Latimer, K. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics24, 319–324 (2008). [DOI] [PubMed] [Google Scholar]
- 13.Hernández-Salmerón, J. E. & Moreno-Hagelsieb, G. Progress in quickly finding orthologs as reciprocal best hits: Comparing blast, last, diamond and MMseqs2. BMC Genomics21(1), 741 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cock, P. J. A. et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics25, 1422–1423 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Löytynoja, A. Phylogeny-aware alignment with PRANK and PAGAN. Methods Mol. Biol.2231, 17–37 (2021). [DOI] [PubMed] [Google Scholar]
- 16.Suyama, M., Torrents, D. & Bork, P. PAL2NAL: Robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res.34, W609 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Larsson, A. AliView: A fast and lightweight alignment viewer and editor for large datasets. Bioinformatics30, 3276–3278 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Steenwyk, J. L., Buida, T. J., Li, Y., Shen, X. X. & Rokas, A. ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS Biol.18, e3001007 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Xia, X. DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution. Mol. Biol. Evol.30, 1720 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Xia, X., Xie, Z., Salemi, M., Chen, L. & Wang, Y. An index of substitution saturation and its application. Mol. Phylogenet. Evol.26, 1–7 (2003). [DOI] [PubMed] [Google Scholar]
- 21.Wong, T. K. F. et al. IQ-TREE 3: Phylogenomic Inference Software using Complex Evolutionary Models. 10.32942/X2P62N (2025).
- 22.Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Von Haeseler, A. & Jermiin, L. S. ModelFinder: Fast model selection for accurate phylogenetic estimates. Nat. Methods14(6), 587–589 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kosakovsky Pond, S. L. et al. HyPhy 2.5—A customizable platform for evolutionary hypothesis testing using phylogenies. Mol. Biol. Evol.37, 295 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Weaver, S. et al. Datamonkey 2.0: A modern web application for characterizing selective and other evolutionary processes. Mol. Biol. Evol.35, 773–777 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Murrell, B. et al. Gene-wide identification of episodic selection. Mol. Biol. Evol.32, 1365 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Smith, M. D. et al. Less is more: An adaptive branch-site random effects model for efficient detection of episodic diversifying selection. Mol. Biol. Evol.32, 1342 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Murrell, B. et al. Detecting individual sites subject to episodic diversifying selection. PLoS Genet.8, e1002764 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol.48, 443–453 (1970). [DOI] [PubMed] [Google Scholar]
- 29.Sora, V. et al. RosettaDDGPrediction for high-throughput mutational scans: From stability to binding. Protein Sci.32, e4527 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Eastman, P. et al. OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comput. Biol.13, e1005659 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data, computational analyses, and associated materials generated in this study are publicly available at Zenodo under the DOI: 1 0. 5 2 81/zenodo.17682440.


